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Abstract 

In the literature surrounding Bayesian penalized regression, the two primary choices of 
prior distribution on the regression coefficients are zero-mean Gaussian and Laplace. While 
both have been compared numerically and theoretically, there remains little guidance on which 
to use in real-life situations. We propose two viable solutions to this problem in the form of prior 
distributions which combine and compromise between Laplace and Gaussian priors, respectively. 
Through cross-validation the prior which optimizes prediction performance is automatically 
selected. We then demonstrate the improved performance of these new prior distributions 
relative to Laplace and Gaussian priors in both a simulated and experimental environment. 



1 Revivification of work presented at the CMS-MfTACS Joint Conference, May 31 to June 3, 2007 (0). Since this 
time, considerable effort has been made on these and related models (i.e. [7J, [3], and [B]). Our goal in producing 
this technical report is simply to make more readily available (in comparison to the poster format of the original) 
our initial contribution (|2]). This document is Technical Report #254, Department of Statistics, The University of 
British Columbia. 
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1 Introduction 

Assume we have a data set of p-dimensional input vectors {x n }^ =1 and corresponding response 
variables {y n }n=i- While we study univariate responses for simplicity of presentation, the methods 
underlying penalized regression easily extend to multivariate responses. We generally assume that 
the response is a noisy realization of some functional relationship y n = f(x n ,f3) + e n , where e n 
is i.i.d. noise and (3 is a vector of regression weights. Many common models fall under the class 
of functions that are a linear sum of M basis functions <fii( x n) {f(x n ,(3) = YliLifii c t ) i{ x n))- For 
example, simple linear regression corresponds to the case M = p and 4>i(x n ) = x n> i, the i th 
component of x n . Support vector machines correspond to M = N and (f>i(x n ) = K(xi,x n ) for 
some suitable choice of kernel, K. Although the form of the models are similar, different solutions 
may be reached through different model fitting techniques. 

An often used non-Bayesian method for estimating the coefficients (3 is to minimize the squared 
error between the model f(x n ,/3) and the response y n , 

N 

& = m i n (/O^ 0) ~ 2/™) 2 • 

P n=l 

This estimate of (3 then allows us to predict future responses vn+i based on xn+i using the model 
f(xN + i,/3). Although this framework is simply extended to situations of non-continuous responses 
using, for instance, a logit link, we continue to focus on regression for the sake of clarity. 

It is well known that complex and flexible models based on the least-squares estimate of (3 often 
over-fit the data. In fact, as the number of basis functions (and hence parameters) grows, we can 
fit the data arbitrarily well. Although an approach to prevent over-fitting is to use simpler models 
which don't capture the features of the data as accurately, the lack of flexibility will continue to 
lead to poor prediction performance. A more often applied approach to over- fitting is to regularize 
the estimate of (3 by shrinking it towards zero, resulting in a smoother functional form with better 
extrapolation performance. Typically this regularization is accomplished by penalizing large values 
of the regression coefficients, for instance using an optimization problem of the form 

N 

(3 = min J] (f(x n , (3) - y n f + Ar?(/3) 

P 71=1 
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where rj(f3) is the L\ or Li norm of (3 anci S?=i l/^J 2 ' respectively) which penalizes 

large values of the parameters The constant A, which is typically chosen with cross-validation, 
controls the trade-off between least-squares (A = 0) and more shrunken estimates of /3. 

The focus of this work is on Bayesian methods for penalized regression. While the standard 
prior distribution used is zero-mean Gaussian, much recent work has been done on an alternative 
choice - the Laplace distribution ([8], [5]). One of the primary advantages of this new prior is 
stronger shrinkage towards zero of the weakly related parameters. Zou and Hastie ([H]) have 
shown that the lasso ([TO]), which corresponds to a MAP estimator using a Laplace prior, lacks a 
grouping effect - the ability of the method to ensure highly correlated variables are assigned similar 
regression coefficients. As a result, they propose the elastic net, which by simultaneously penalizing 
both the L\ and L2 norms of the regression coefficients, has many of the nice properties of the lasso 
while also exhibiting a grouping effect. 

Our premonition is that the Laplace prior will suffer from a lack of grouping effect even when 
using point estimates other than the MAP estimator, such as the posterior mean or median. As a 
result, we propose two new priors which seek to address this problem, one of whose MAP estimator 
corresponds to bridge regression ([!]) and the other to the elastic net ([H]). In Section 2 we 
make explicit the problem of penalized regression from a Bayesian perspective and examine two 
commonly used priors. Our two alternative prior distributions are introduced in Section 3. All four 
prior distributions are compared in Section 4 using a simulated example and in Section 5 using 
experimental data. We conclude with discussion and closing remarks in Section 6. 

2 Bayesian Penalized Regression 

If we assume that the errors e n are distributed as Gaussian random variables with mean and 
variance a 2 , then the likelihood of the data {y n , x n ; n = 1, • • • , N} will be 



x y I i=l ) 

From a Bayesian perspective, we wish to assume some initial structure on (3, then use the data (in 
the form of the likelihood) along with Bayes' theorem to update our knowledge of f3 in the form of a 
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posterior distribution. We may then use some function of the posterior, such as the mean or median, 
as our estimate (3. As in the non-Bayesian approach, we seek to shrink the regression coefficient 
estimates toward zero to improve prediction performance and generalizability. The natural way 
to do this is to use prior distributions for (3 which are focussed around zero. The two primary 
prior distributions employed for this purpose are the Gaussian and Laplace distributions. While 
the Laplace prior has been extensively studied from the viewpoint of improving identification or 
prediction of models over the conjugate Gaussian prior, our focus is on finding a prior distribution 
which exhibits a grouping effect while maintaining excellent prediction and identification. In all 
subsequent developments we employ an inverse Gamma prior on a 2 with parameter vector (a, b) 
and assume the function f(x n ,(3) to take a linear form, f(x n ,(3) = X/3. Here X is the N by p 
matrix of predictors x n ;n = 1, . . . , N. We also let y denote the length N vector of responses y n . 

2.1 Gaussian priors 

In order to induce shrinkage in the estimate of /3, we focus our prior distribution around zero. One 
such option is to use a Gaussian distribution: 

We can adjust the amount of shrinkage induced on /3 by varying the prior covariance matrix X. 
Using this prior, the joint posterior of /3 and a 2 is 

tt(/3, a 2 \y, X) oc ((J 2 ) -(iV +P )/2- a -i exp j _ _i_ [(y _ X/3) T (y _ X/3) + ^-^T + 2b] j 

Because of the conjugate nature of this prior, we can explicitly obtain the marginal posterior 
distribution of /3, namely a student-t distribution with N + 2a degrees of freedom and parameters 

£= (S- 1 + X T X)- 1 ((X T X)^) 

e = ^^MWp-, + (x , x)rl . 

N + 2a " 
where /3 = (X T X)~ 1 X T y and s 2 = (y - X/3) T (y - X/3). 
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This marginal demonstrates the influence of S on the posterior estimate of (3. Some typical 
choices are E = cl, a scaled identity matrix (as in ridge regression), and S = 5(X T X)-\ cor- 
responding to Zellner's g-Prior. Because of its ability to automatically estimate the correlation 
structure in (3 and the ability to control shrinkage with the tuning parameter g, we will focus on 
the g-Prior when comparing this prior to other alternatives. As in the non-Bayesian version of 
penalized regression, the parameter g may be set to maximize prediction performance by using 
cross-validation. A benefit of a Gaussian prior is the explicit derivation, allowing for fast and 
efficient cross-validation and analysis. We subsequently look at Laplace priors, which while not 
admitting a closed-form solution, exhibit convenient shrinkage properties. 

2.2 Laplace priors 

Alongisde the wildly popular lasso, Laplace priors have been used on (3 with the convenient fact 
that the MAP estimate of (3 corresponds to the lasso solution. A Laplace prior distribution has 
the form 



Following [8|, we parametrize this distribution with X/ya 2 instead of the more traditional A. As 

pointed out in [8], this normalization provides for a unimodal posterior, allowing for easier use 
of posterior approximation methods, namely MCMC. Like the Gaussian distribution, the Laplace 
distribution can be changed to have different variance around zero by adjusting A. However, in this 
case increasing A results in shrunken (3. Using this prior, the joint posterior of (3 and a 2 is 



Although we could use a Metropolis-Hastings algorithm to sample from this joint posterior, we can 
be more clever, exploiting the representation of a Laplace distribution as an infinite mixture of 
Gaussians, namely 





7T 
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(1) 
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Using this representation, we may introduce latent variables r 2 , . . . , t? and employ a Gibbs sampler. 
The latent variables may be integrated out of the final joint posterior to give the correct marginal 
distributions of f3 and a 2 . To implement the Gibbs sampler, the full conditionals are 

(3\a 2 , rl, . . . , r 2 , y, X ~ N ((D/ 1 + X T X)- 1 ((X T X) / 3), <r 2 (D T - 1 + X T X)~ 
a 2 \(3, rl . . . , r 2 , y, X ~ IG (n/2 + p /2 + a,b + J^(f(x n , (3) - y n f : /2 + A^D^ 1 (3/2^ 



l/r 2 \P,a 2 ,y,X. ~ IGauss \u = J^pA' = A 



\2 







where D T is a diagonal matrix with elements r 2 , . . . , r 2 . Once again cross-validation may be used 
to select A. However, using MCMC to perform cross-validation is typically not computationally 
feasible, and hence alternatives must be used (11]). Because our focus lies on grouping and prediction 
properties, an approximately optimal choice of parameter will suffice, hence we use the LARS 
algorithm (corresponding to the MAP estimate) to obtain an approximately optimal choice of A. 



3 Combined and Compromise Priors 

We have described the two primary choices of prior distribution for performing penalized regression 
in a Bayesian setting. Regardless of how g and A are set, both priors result in different posterior 
distributions 7r(/3|y,X), each with their own benefit and performance gains for different situations. 
Thus it would be nice to have a way to automatically select which prior to use. We attempt to 
address this problem, and in the process do even better. We introduce two priors for (3, one which 
is a compromise between a Laplace and Gaussian prior, and the other which is a combination of 
the two, and automatically detects the proportion of each to use as the prior (including the limiting 
cases of purely Laplace and purely Gaussian). 

3.1 L q prior 

The first alternative choice of prior corresponds to bridge regression in the classical scenerio, where 
a compromise is found between ridge regression and lasso by controlling the exponent on the penalty 
term. This form of prior has previously been suggested in the literature ([8], [2]). The corresponding 
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prior distribution is 

ir(j3\a 2 ) oc ( ~2 j exp < 




g 6(1,2) 



Although this distribution can also be treated as an infinite mixture of Gaussians, the result- 
ing full conditionals do not adhere to well-known distributions, and hence a Gibbs sampler is 
not easily implemented. However, the full conditional for a 2 remains an inverse gamma, and so 
we are able to use Metropolis-Hastings within Gibbs, sampling a new f3 (/3*) from a multivari- 
ate Gaussian with mean /3 and variance c" 2 (X :r X) _1 , and accepting this sample with probability 
ex P {~2^ (Sj=i I/ 3 *i I" ~ Ej=i \Pf~ 1} \ q ) }> otherwise keeping (3 {t ~ l \ Here (3 {t - l) is the sample 
at the previous step of the Markov chain. Thus to implement this prior requires sampling from 
well-known distributions and calculating a simple acceptance ratio. It is worth noting that as A 
grows towards infinity, the acceptance ratio decreases to zero, thus more clever proposal distribu- 
tions must be used. One such possibility is to use the posterior from the Laplace and Gaussian prior 
discussed in sections 2.1 and 2.2 as the proposal. Because of the computational expense associated 
with Metropolis-Hastings, we use the MAP estimate (obtained using Newton-Raphson) to perform 
cross-validation to obtain approximately optimal choices of q and A. 

3.2 Bayesian elastic net 

Much like the MAP estimates from using the previous 3 priors corresponds to ridge regression, the 
lasso, and bridge regression, the subsequently presented prior has a MAP estimate corresponding to 
the elastic net ([H]), hence the name. This new prior takes the form of a mixture of Gaussian and 
Laplace prior distributions, and in fact contains both as special cases. In addition, the resulting 
posterior distribution is obtainable through a Gibbs sampler, which from our experience converges 
quickly. Specifically, the prior has the form 



7r(/3) oc exp < 



Ai , „, A9 



a i=i 



> . 



By once again representing the Laplace distribution as a mixture of Gaussians ([I]) , we can employ 
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a Gibbs sampler. By again using latent variables, the full conditionals for the Gibbs sampler are 



f3\a 2 , r 2 , . . . , r 2 , y, X ~ N ((Dr 1 + X T X + A 2 I)- 1 ((X T X)£), ^(D/ 1 + X T X + A 2 I)- ] ' 
a 2 \(3, t?, . . • , r 2 , y, X ~ IG (n/2 + p/2 + a, b + (y - X/3) T (y - X/3)/2 + A 1/ 3 T D T - 1 /3/2 + X 2 (3 T (3/2) 



1/t 2 |/3, <r 2 , y, X ~ IGauss [ u = \ j ^ 2 > ^' = ^i 



As before, the MAP estimate (obtained with the LARS algorithm) may be used with cross- 
validation to obtain approximately optimal choices of Ai and A 2 . One caveat is that both this 
prior and the previous (Lq) contain two tuning parameters, meaning that cross-validation must 
be performed over a 2-dimensional grid, increasing computational time relative to the Laplace or 
Gaussian prior. Figure [T] presents the four prior distributions presented for various parameter 
values. Note that the Laplace and Gaussian priors are limiting cases of the L q prior. 

4 Comparison on Simulated Data 

To demonstrate the performance of the priors we conduct a simulation study, generating a dataset 
containing 12 observations with X (of dimension 10) from a Gaussian distribution with mean 
and variance 1. Variables (3,4) and (5,6) have correlation 0.85 and 0.95, respectively, and the 
remainder are independent. The response is generated using the relationship y = X/3 + e where the 
first 6 components of j3 are {.5, —.5, .5, .5, —.5, —.5} and the remainder at 0. We employ Gaussian 
noise on the observations, specifically e ~ N(0, 1). We then generate an additional 50 observations 
as a testing set to check each method's performance. All variables are standardized to have mean 
zero and variance 1. Lastly, we repeat the process 1500 times. Following this, the entire experiment 
is repeated with X having dimension 100 instead of 10 to look at the effective shrinkage of each 
method. The two measures of performance we use are mean squared prediction error, as well as 
a unique measure we term "grouping error" (GE). Specifically, this is the mean squared error in 
predicting the regression coefficients for the 4 highly correlated variables. If from a pair of correlated 
variables a method finds largely different values for the regression coefficients, the grouping error 
will be quite large. All methods relying on Monte Carlo approximations used Markov chains of 
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Figure 1: Prior distributions for various settings of tuning parameters. Top: L q prior. Bottom: 
Bayesian elastic net prior 
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length 50, 000 with 10, 000 burn-in. Despite using MAP estimation to tune the models with cross- 
validation, this experiment was computationally expensive on the order of days. We observed 
similar computation times for all of the methods relying on Monte Carlo approximations. On the 
opposite end of the spectrum, both the least squares and Gaussian prior solutions required less 
than 1 second to compute. 

We see in Figure [2] that when p < N, specifically p = 10, N = 12, the Gaussian prior and 
the two elastic net solutions (frequentist and Bayesian) result in the best prediction performance. 
However, when p = 100, the Gaussian prior shows a significant decrease in performance, suggesting 
its inability to provide adequate shrinkage on the regression coefficients. While all the methods have 
reduced performance with the additional nuisance variables, the Gaussian prior is noticeably more 
influenced by the increased dimensionality than the rest. We also notice that the lasso solutions 
(and to a slightly lesser extent the L q prior) have unreliable grouping properties. For instance, a 
GE value of 100 means that at least one of the correlated variables had an estimated regression 
coefficient which was 10 off the true value. Considering that the true values were all ±0.5, this is 
a considerable estimation error. We notice only a slight degradation in grouping performance with 
increasing dimension of the predictor matrix, indicating that grouping effect and sparsity level are 
not closely connected. 

5 Comparison on Real-Life Data 

The prostate data of Stamey et al. ([9]) will be used in order to facilitate comparison with earlier 
work (pi)], [1]). This study examined the level of prostate specific antigen as correlated with a 
number of clinical measures. Consisting of 97 observations, we split the data into training (60 
observations) and testing (the remaining 37) sets to check each method's performance. As in 
the above simulation study, we standardize the data. In order to get a more robust measure of 
performance on the training set, we cycle through 100 random divisions of the data into training 
and testing sets. The MSE on these 100 testing sets is shown in table [T] From this we see that 
all of the methods have similar prediction performance, with the Bayesian method with Lq prior 
outperforming the others. It is interesting to note that after the Lq prior, the two elastic net 
solutions provided the best prediction. As with the simulation experiment, least squares provided 
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Figure 2: Boxplot of Errors on Simulated Data. LS = Least Squares, L = Lasso, EN = Elastic 
Net, BG = Bayes with Gaussian Prior, BL = Bayesian Lasso, BLq = Bayes with Lq Prior, BEN 
= Bayesian Elastic Net 
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Least 
Squares 


Freq 
Lasso 


Freq 
ENet 


Bayes 
Gaussian 


Bayes 
Lasso 


Bayes 
Lq 


Bayes 
ENet 


0.3908 


0.3888 


0.3862 


0.3892 


0.3875 


0.3838 


0.3874 



Table 1: MSE for different prediction methods 



the worst prediction, which is sensible due to the flexibility of the alternatives and the cross- 
validation we employ to tune each. 

6 Conclusions 

After describing the Laplace and Gaussian prior distributions used in penalized regression, we sub- 
sequently proposed two alternatives which trade off between these two. In fact, the two priors 
proposed above contain Laplace and Gaussian priors as a special case. Through simulation and ex- 
perimental results, we observed that different priors might be recommended in different situations. 
When computational expense is a major concern, Gaussian priors are extremely convenient in al- 
lowing for a closed-form solution while also exhibiting a grouping effect, although their performance 
degrades with large numbers of nuisance variables. 

Because penalized regression is often embedded into more complex problems where a Gibbs 
sampler (or Metropolis within Gibbs) is already being used, alternative priors should be considered. 
In these situations, we have found the Bayesian Elastic Net to be among the leaders both in terms 
of prediction performance and grouping effect. Based on our experiences, the L q prior distribution 
introduced in Section 3 is competitive with the others, but due to the requirement of a Metropolis 
step in its computation, we prefer the Bayesian Elastic Net. However, further work may show 
situations where the Lq prior is to be preferred. In conclusion, we have found the Bayesian lasso and 
elastic net to exhibit similar performance as their frequentist counterparts. Because of its excellent 
prediction performance and grouping effect, we recommend the Bayesian elastic net in situations 
requiring shrinkage on the regression coefficients, particularly when the predictor variables are 
expected to be highly correlated. 
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